The purpose of this analysis is to make up a prediction model where we will be able to predict whether a recommendation is positive or negative. In this analysis, we will not focus on the Score, but only the positive/negative sentiment of the recommendation.
To do so, we will work on Amazon's recommendation dataset, we will build a Term-doc incidence matrix using term frequency and inverse document frequency ponderation. When the data is ready, we will load it into predicitve algorithms, mainly naïve Bayesian and regression.
In the end, we hope to find a "best" model for predicting the recommendation's sentiment.
In order to load the data, we will use the SQLITE dataset where we will only fetch the Score and the recommendation summary.
As we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "postive". Otherwise, it will be set to "negative".
The data will be split into an training set and a test set with a test set ratio of 0.2
In [1]:
%matplotlib inline
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
Let's first check whether we have the dataset available:
In [2]:
import os
from IPython.core.display import display, HTML
if not os.path.isfile('database.sqlite'):
display(HTML("<h3 style='color: red'>Dataset database missing!</h3><h3> Please download it "+
"<a href='https://www.kaggle.com/snap/amazon-fine-food-reviews'>from here on Kaggle</a> "+
"and extract it to the current directory."))
raise(Exception("missing dataset"))
In [3]:
con = sqlite3.connect('database.sqlite')
pd.read_sql_query("SELECT * FROM Reviews LIMIT 3", con)
Out[3]:
Let's select only what's of interest to us:
In [4]:
messages = pd.read_sql_query("""
SELECT
Score,
Summary,
HelpfulnessNumerator as VotesHelpful,
HelpfulnessDenominator as VotesTotal
FROM Reviews
WHERE Score != 3""", con)
Let's see what we've got:
In [5]:
messages.head(5)
Out[5]:
Let's add the Sentiment column that turns the numeric score into either positive or negative.
In [6]:
messages["Sentiment"] = messages["Score"].apply(lambda score: "positive" if score > 3 else "negative")
messages.head(2)
Out[6]:
Similarly, the Usefulness column turns the number of votes into either useful
or useless
using the formula
(VotesHelpful/VotesTotal) > 0.8
In [7]:
messages["Usefulness"] = TODO
messages.head(2)
Let's have a look at some 5s:
In [ ]:
messages[messages.Score == 5].head(10)
And some 1s as well:
In [ ]:
TODO: select some reviews with score 1
In [ ]:
from wordcloud import WordCloud, STOPWORDS
# Note: you need to install wordcloud with pip.
# On Windows, you might need a binary package obtainable from here: http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud
stopwords = set(STOPWORDS)
#mpl.rcParams['figure.figsize']=(8.0,6.0) #(6.0,4.0)
mpl.rcParams['font.size']=12 #10
mpl.rcParams['savefig.dpi']=100 #72
mpl.rcParams['figure.subplot.bottom']=.1
def show_wordcloud(data, title = None):
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
max_words=200,
max_font_size=40,
scale=3,
random_state=1 # chosen at random by flipping a coin; it was heads
).generate(str(data))
fig = plt.figure(1, figsize=(8, 8))
plt.axis('off')
if title:
fig.suptitle(title, fontsize=20)
fig.subplots_adjust(top=2.3)
plt.imshow(wordcloud)
plt.show()
show_wordcloud(messages["Summary_Clean"])
We can also view wordclouds for only positive or only negative entries:
In [ ]:
TODO: create word cloud from negative reviews
In [ ]:
TODO: create word cloud from positive reviews
SciKit cannot work with words, so we'll just assign a new dimention to each word and work with word counts.
See more here: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In [ ]:
# first do some cleanup
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import re
import string
cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
sentence = sentence.lower()
sentence = cleanup_re.sub(' ', sentence).strip()
return sentence
messages["Summary_Clean"] = messages["Summary"].apply(cleanup)
train, test = train_test_split(messages, test_size=0.2)
print("%d items in training data, %d in test data" % (len(train), len(test)))
In [ ]:
# To cleanup stop words, add stop_words = STOPWORDS
# But it seems to function better without it
count_vect = CountVectorizer(min_df = 1, ngram_range = (1, 4))
X_train_counts = count_vect.fit_transform(train["Summary_Clean"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_counts = count_vect.transform(test["Summary_Clean"])
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
y_train = train["Sentiment"]
y_test = test["Sentiment"]
# prepare
prediction = dict()
In [ ]:
word_features = count_vect.get_feature_names()
word_features[10000:10010]
In [ ]:
chosen_word_idx = 99766
chosen_word_indices = np.nonzero(X_train_counts[:,chosen_word_idx].toarray().ravel())[0]
for i in chosen_word_indices[0:10]:
print("'%s' appears %d times in: %s" % (
word_features[chosen_word_idx],
X_train_counts[i,chosen_word_idx],
train["Summary"].values[i]
))
In [ ]:
#TODO: find the counts for "gluten" and the reviews it appears in
In [ ]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train_tfidf, y_train)
prediction['Multinomial'] = model.predict(X_test_tfidf)
In [ ]:
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train_tfidf, y_train)
prediction['Bernoulli'] = model.predict(X_test_tfidf)
In [ ]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg_result = logreg.fit(X_train_tfidf, y_train)
prediction['Logistic'] = logreg.predict(X_test_tfidf)
In [ ]:
from sklearn.svm import LinearSVC
linsvc = LinearSVC(C=1e5)
linsvc_result = linsvc.fit(X_train_tfidf, y_train)
prediction['LinearSVC'] = linsvc.predict(X_test_tfidf)
Before analyzing the results, let's remember what Precision and Recall are (more here https://en.wikipedia.org/wiki/Precision_and_recall)
In order to compare our learning algorithms, let's build the ROC curve. The curve with the highest AUC value will show our "best" algorithm.
In first data cleaning, stop-words removal has been used, but the results were much worse. Reason for this result could be that when people want to speak about what is or is not good, they use many small words like "not" for instance, and these words will typically be tagged as stop-words, and will be removed. This is why in the end, it was decided to keep the stop-words. For those who would like to try it by themselves, I have let the stop-words removal as a comment in the cleaning part of the analysis.
In [ ]:
def formatt(x):
if x == 'negative':
return 0
return 1
vfunc = np.vectorize(formatt)
cmp = 0
colors = ['b', 'g', 'y', 'm', 'k']
for model, predicted in prediction.items():
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test.map(formatt), vfunc(predicted))
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC %0.2f'% (model,roc_auc))
cmp += 1
plt.title('Classifiers comparison with ROC')
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
After plotting the ROC curve, it would appear that the Logistic regression method provides us with the best results, although the AUC value for this method is not outstanding...
I looks like the best are LogisticRegression and LinearSVC. Let's see the accuracy, recall and confusion matrix for these models:
In [ ]:
for model_name in ["Logistic", "LinearSVC"]:
print("Confusion matrix for %s" % model_name)
print(metrics.classification_report(y_test, prediction[model_name], target_names = ["positive", "negative"]))
print()
In [ ]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues, labels=["positive", "negative"]):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(labels))
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cm = confusion_matrix(y_test, prediction['Logistic'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
plt.show()
Let's also have a look at what the best & words are by looking at the coefficients:
In [ ]:
words = count_vect.get_feature_names()
feature_coefs = pd.DataFrame(
data = list(zip(words, logreg_result.coef_[0])),
columns = ['feature', 'coef'])
feature_coefs.sort_values(by='coef')
In [ ]:
def test_sample(model, sample):
sample_counts = count_vect.transform([sample])
sample_tfidf = tfidf_transformer.transform(sample_counts)
result = model.predict(sample_tfidf)[0]
prob = model.predict_proba(sample_tfidf)[0]
print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(), prob[0], prob[1]))
test_sample(logreg, "The food was delicious, it smelled great and the taste was awesome")
test_sample(logreg, "The whole experience was horrible. The smell was so bad that it literally made me sick.")
test_sample(logreg, "The food was ok, I guess. The smell wasn't very good, but the taste was ok.")
In [ ]:
show_wordcloud(messages[messages.Usefulness == "useful"]["Summary_Clean"], title = "Useful")
show_wordcloud(messages[messages.Usefulness == "useless"]["Summary_Clean"], title = "Useless")
Nothing seems to pop out.. let's try to limit the dataset to only entries with at least 10 votes.
In [ ]:
messages_ufn = messages[messages.VotesTotal >= 10]
messages_ufn.head()
Now let's try again with the word clouds:
In [ ]:
show_wordcloud(messages_ufn[messages_ufn.Usefulness == "useful"]["Summary_Clean"], title = "Useful")
show_wordcloud(messages_ufn[messages_ufn.Usefulness == "useless"]["Summary_Clean"], title = "Useless")
This seems a bit better, let's see if we can build a model though
In [ ]:
from sklearn.pipeline import Pipeline
train_ufn, test_ufn = train_test_split(messages_ufn, test_size=0.2)
ufn_pipe = Pipeline([
('vect', CountVectorizer(min_df = 1, ngram_range = (1, 4))),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(C=1e5)),
])
ufn_result = ufn_pipe.fit(train_ufn["Summary_Clean"], train_ufn["Usefulness"])
prediction['Logistic_Usefulness'] = ufn_pipe.predict(test_ufn["Summary_Clean"])
print(metrics.classification_report(test_ufn["Usefulness"], prediction['Logistic_Usefulness']))
Let's also see which of the reviews are rated by our model as most helpful and least helpful:
In [ ]:
ufn_scores = [a[0] for a in ufn_pipe.predict_proba(train_ufn["Summary"])]
ufn_scores = zip(ufn_scores, train_ufn["Summary"], train_ufn["VotesHelpful"], train_ufn["VotesTotal"])
ufn_scores = sorted(ufn_scores, key=lambda t: t[0], reverse=True)
# just make this into a DataFrame since jupyter renders it nicely:
pd.DataFrame(ufn_scores)
In [ ]:
cm = confusion_matrix(test_ufn["Usefulness"], prediction['Logistic_Usefulness'])
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm_normalized, labels=["useful", "useless"])
In [ ]:
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
# Useful to select only certain features in a dataset for forwarding through a pipeline
# See: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
train_ufn2, test_ufn2 = train_test_split(messages_ufn, test_size=0.2)
ufn_pipe2 = Pipeline([
('union', FeatureUnion(
transformer_list = [
('summary', Pipeline([
('textsel', ItemSelector(key='Summary_Clean')),
('vect', CountVectorizer(min_df = 1, ngram_range = (1, 4))),
('tfidf', TfidfTransformer())])),
('score', ItemSelector(key=['Score']))
],
transformer_weights = {
'summary': 0.2,
'score': 0.8
}
)),
('model', LogisticRegression(C=1e5))
])
ufn_result2 = ufn_pipe2.fit(train_ufn2, train_ufn2["Usefulness"])
prediction['Logistic_Usefulness2'] = ufn_pipe2.predict(test_ufn2)
print(metrics.classification_report(test_ufn2["Usefulness"], prediction['Logistic_Usefulness2']))
In [ ]:
cm = confusion_matrix(test_ufn2["Usefulness"], prediction['Logistic_Usefulness2'])
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm_normalized, labels=["useful", "useless"])
In [ ]:
len(ufn_result2.named_steps['model'].coef_[0])
Again, let's have a look at the best/worst words:
In [ ]:
ufn_summary_pipe = next(tr[1] for tr in ufn_result2.named_steps["union"].transformer_list if tr[0]=='summary')
ufn_words = ufn_summary_pipe.named_steps['vect'].get_feature_names()
ufn_features = ufn_words + ["Score"]
ufn_feature_coefs = pd.DataFrame(
data = list(zip(ufn_features, ufn_result2.named_steps['model'].coef_[0])),
columns = ['feature', 'coef'])
ufn_feature_coefs.sort_values(by='coef')
In [ ]:
print("And the coefficient of the Score variable: ")
ufn_feature_coefs[ufn_feature_coefs.feature == 'Score']
In [ ]: